A Framework for Frequent Sequence Mining under Generalized Regular Expression Constraints
نویسندگان
چکیده
This paper provides a framework for the extraction of frequent sequences satisfying a given regular expression (RE) constraint. We take advantage of the information contained in the hierarchical representation of an RE by abstract syntax trees (AST). Interestingly, pruning can be based on the anti-monotonicity of the minimal frequency constraint, but also on the RE constraint, even though this latter is generally not anti-monotonic. The AST representation enables to examine the decomposition the RE and to choose dynamically an adequate extraction method according to the local selectivity of the sub REs. Our algorithm, RE-Hackle, explores only the candidate space spanned over the regular expression, and prunes it at each level. Due to the dynamic choice of the exploration method, this algorithm surpasses its predecessors. We provide an experimental validation on both synthetic data and a real genomic sequence database. Furthermore, we show how this framework can be extended to regular expressions with variables providing context-sensitive specification of the desired sequences.
منابع مشابه
Methods for Frequent Sequence Mining with Subsequence Constraints
In this thesis, we study scalable and general purpose methods for mining frequent sequences that satisfy a given subsequence constraint. Frequent sequence mining is a fundamental task in data mining and has many real-life applications like information extraction, market-basket analysis, web usage mining, or session analysis. Depending on the underlying application, we are generally interested i...
متن کاملMining frequent sequential patterns under regular expressions: a highly adaptative strategy for pushing constraints∗
This paper introduces a new framework for the extraction of frequent sequences satisfying a given regular expression (RE) constraint. Contrary to previous work (SPIRIT algorithms), we represent REs by tree structures and our algorithm can choose dynamically an extraction method according to the local selectivity of the sub-REs. Interestingly, pruning can rely not only on the anti-monotonic mini...
متن کاملSequential Pattern Mining Using Formal language Tools
In present scenario almost every system and working is computerized and hence all information and data are being stored in Computers. Huge collections of data are emerging. Retrieval of untouched, hidden and important information from this huge data is quite tedious work. Data Mining is a great technological solution which extracts untouched, hidden and important information from vast databases...
متن کاملA new method for finding generalized frequent itemsets in generalized association rule mining
Generalized association rule mining is an extension of traditional association rule mining to discover more informative rules, given a taxonomy. In this paper, we describe a formal framework for the problem of mining generalized association rules. In the framework, The subset-superset and the parent-child relationships among generalized itemsets are introduced to present the different views of ...
متن کاملSPIRIT: Sequential Pattern Mining with Regular Expression Constraints
Discovering sequential patterns is an important problem in data mining with a host of application domains including medicine, telecommunications, and the World Wide Web. Conventional mining systems provide users with only a very restricted mechanism (based on minimum support) for specifying patterns of interest. In this paper, we propose the use of Regular Expressions (REs) as a flexible constr...
متن کامل